Enhancing Clustering Mechanism by Customised Expectation–Maximization Algorithm: A Review
نویسنده
چکیده
Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. Challenges include analysis, data curtain, search, sharing, storage, transfer, visualization, querying & information privacy. term often refers simply to use of predictive analytics or certain other advanced methods to extract value from data, & seldom to a particular size of data set. Accuracy in big data might lead to more confident decision making, & better decisions could result in greater operational efficiency, cost reduction & reduced risk. Data mining is central step in a process called knowledge discovery in databases, namely step in which modeling techniques are include. Research areas like artificial intelligence, machine learning, & soft computing have contributed to its arsenal of methods. In our opinion fuzzy approaches could play an important role in data mining, because they given comprehensible results (although this goal is maybe because this is sometimes hard to achieve with other methods). In addition, approaches studied in data mining have mainly been oriented at highly structured & precise data. However, we expect that analysis of more complex heterogeneous information source like texts, images, rule bases etc. would become more important in near future. Therefore we give an outlook on information mining, we have see as an extension of data mining to treat difficult data in heterogeneous information sources, & argue that fuzzy systems are useful in meeting challenges of information mining. Soft computing is use of inexact solutions to computationally hard tasks such as solution of NP-complete problems, there is no known algorithm that could compute an exact solution in polynomial time. process of knowledge in databases, often also called data mining, is first important step in knowledge management technology. End users of these tools & systems are at all levels of management operative workers & managers. & these are their demands on processing & analysis of data & information that affect development of these tools. Keywords—Data mining, web mining, web intelligence, knowledge discovery, fuzzy logic, K-mean Jyoti et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.9, September2016, pg. 67-71 © 2016, IJCSMC All Rights Reserved 68 [1] Introduction Analysis of data sets could find new correlations to "spot business trends, prevent diseases, combat crime & so on." Scientists, business executives, practitioners of medicine, advertising & governments alike common meet are difficulties with large data sets in areas including web search, finance & business informatics. Scientists encounter limitations in including meteorology, genomics, connectomics, difficult physics simulations, biology & environmental research. Data sets are growing rapidly in part because they are increasingly gathered by less & numerous information sensing mobile devices, aerial software logs, microphones, radiofrequency identification (RFID) readers & wireless sensor networks. world's technological capacity to store more information has roughly doubled every 40 months since 1980s; as of 2012, every day 2.5 Exabyte’s of data are created. Relational database management systems & desktop statistics packages often have difficulty handling big data. work instead requires parallel software running on tens, even thousands of servers". What is considered "big data" varies depending on capabilities of users & their tools, & expanding capabilities make big data a moving target. "For some business facing hundreds of gigabytes of data for first time might trigger a need to reconsider data management options. For others, this might take tens or hundreds of TB before data size becomes a important consideration. Data mining (the analysis step of "Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of computer science, is computational process of discovering system in large data sets involving methods at intersection of artificial intelligence, statistics, & database systems. overall goal of data mining is process of extract information from a data set & transform this into an understandable structure for further use. Aside from raw analysis step, this involves database & data management aspects, data pre-processing, model & inference considerations, interestingness metrics, complexity, post-processing of discovered structures, visualization, & online updating. term is a misnomer, because goal is extraction of patterns & knowledge from large amount of data, not extraction of data itself. It also is a buzzword & is frequently applied to any form of big range data or information processing (collection, extraction, analysis, & statistics) as well as computer decision support system, including artificial intelligence, machine, & business intelligence. Important book "Data mining: special machine learning tools & techniques with Java originally to be named just special machine learning, & term "data mining" was only added for marketing reasons. Often more general terms data analysis or when referring to actual methods, artificial intelligence & machine learning. The actual data mining work is automatic or semi-automatic analysis of big amount of data to extract before unknown interesting patterns such as groups of data records unusual records (anomaly detection) & dependencies. This generally involves using database approach such as spatial indices. These patterns could then be seen as a kind of summary of input data, & might be used in further analysis or, for example, in machine learning & predictive analytics [2] Motivation & Problem statement Web Intelligence [2] based Google Analytics is known as a service offered by Google that generates detailed about a website's traffic & traffic sources & measures conversions & sales. It's most widely used website statistics service. Basic service is free of charge & a premium version is available for a fee. Google Analytics might track visitors from all referrers, including search engines direct visits & referring sites. It also tracks email marketing, & digital collateral such as links within PDF documents. Regular article of Google Analytics Integrated with AdWords, users might now review online campaigns by landing page quality & conversions (goals). Goals might include sales, viewing a specific page, or downloading data. Google Analytics approach is to show highlevel, dashboard -type data/information for casual user, & more in-depth data/information further into report set. Google Analytics analysis might identify poorly performing pages with techniques/technology such as funnel visualization, where visitors came how long they stayed & their geographical position. It also provides more including custom visitor segmentation. Google Analytics e-commerce reporting might track sales activity & performance. e-commerce reports show a site's transactions, revenue, & many other commercerelated metrics. Dashboards give you a summary of many reports on a page. Start within a dashboard with your most important performance indicators then create detailed dashboards for other special topics like search Jyoti et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.9, September2016, pg. 67-71 © 2016, IJCSMC All Rights Reserved 69 engine optimization. Dashboards use to drop widgets for fast, easy customization. Challenging problem in Web Intelligence is how to deal with uncertainty of information on wired & wireless Web. Adapting existing soft computing solutions, when appropriate for WI applications, must incorporate a robust notion of learning that would scale to Web, adapt to individual user requirements, & personalize interfaces. Ongoing efforts exist to integrate logic, artificial neural networks, probabilistic & statistical reasoning, fuzzy sets, rough sets, granular computing, genetic algorithm, & other methodologies in soft computing paradigm, to construct a hybrid approach/system for Web intelligence Internet-level communication, infrastructure, & security protocols. Web is regarded as a computernetwork system. WI techniques/technology for this level include Web data/information prefacing systems built upon Web surfing patterns to resolve issue of Web latency. intelligence of Web prefetching comes from an adaptive learning process based on observation & characterization of user surfing behaviour.2. Interface-level multimedia presentation standards. Web is regarded as an interface for human–Internet interaction. WI techniques/technology for this level are used to develop intelligent Web interfaces in which capabilities of adaptive cross-language processing, personalized multimedia representation & multimodal data/information processing are required. 3. Knowledge-level information processing & management tools. Web is regarded as a distributed data/knowledge base. We need to develop semantic markup languages to represent semantic contents of Web available in machine-understandable formats for agent-based autonomic computing, such as searching, aggregation, classification, filtering, managing, mining, & discovery on Web. [3] Survey of earlier work The use of data mining techniques in manufacturing began in 1990s & this has gradually progressed by receiving attention from production community. These techniques are now used in many different areas in manufacturing engineering to extract knowledge for use in predictive maintenance, fault detection, design, production, quality assurance, scheduling, & decision support systems. Data could be analyzed to identify hidden patterns in parameters that control manufacturing processes or to determine & improve quality of products. A major advantage of data mining is that required data for analysis could be collected during normal operations of manufacturing process being studied & this is therefore generally not necessary to introduce dedicated processes for data collection. Since importance of data mining in manufacturing has clearly increased over last 20 years, this is now appropriate to critically review its history & application. Data mining techniques becomes basic element of modern business. Although idea is not new, new technologies & implemented standards make a contribution to their growing popularity. Regarding to mining model usage SQL Server 2005 stands breakthrough in this area. Thanks to DMX language either programmers or database administrators are able to create Data Mining Systems in simple way. Although economical & business publications are very fruitful of data mining approaches, described problem is presented rather weak in international publications. Nethertheless some industrial appliances of data mining technology were considered in (Duebel, C., 2003). Industrial usage of data mining techniques opens new possibilities in decision making not only for top level management, but also for advisory or control systems. Several prediction, classification or even anomaly detection algorithms implementation might become lucrative tool for industrial process appropriate stages optimization, that combines diagnosis & control functions. reviewed literature shows that there is a rapid growth in application of data mining in industry & manufacturing. However, there is still slow adoption of this technology in some industries for several reasons including both difficulties in determining type of data mining function to be performed in any particular knowledge area & question of choice most appropriate data mining technique regarding to many possibilities. Wadena Wójcik & Konrad Gromaszek (Lublin University of Technology, Poland) introduced “Data Mining Industrial Applications‖. Data mining is blend of concepts & algorithms from machine learning, statistics, artificial intelligence, & data management. With emergence of data mining, Jyoti et al, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.9, September2016, pg. 67-71 © 2016, IJCSMC All Rights Reserved 70 researchers & practitioners began applying this technology on data from different areas such as banking, finance, retail, marketing, insurance, fraud detection, science, engineering, etc., to discover any hidden relationships or patterns. Jiawei Han & Jing Gao University of Illinois at Urbana-Champaign wrote paper on ―Research Challenges for Data Mining in Science & Engineering‖ With rapid development of computer & information technology in last several decades, an enormous amount of data in science & engineering has been & would continuously be generated in massive scale, either being stored in gigantic storage devices or °owing into & out of system in form of data streams. Moreover, such data has been made widely available, e.g., via Internet. Such tremendous amount of data, in order of terato peta-bytes, has fundamentally changed science & engineering, transforming many disciplines from data-poor to increasingly data-rich, & calling for new, data-intensive methods to conduct research in science & engineering. In this paper, they discuss research challenges in science & engineering, from data mining perspective, [4]Tools & technology used
منابع مشابه
Pre Processing Techniques for Arabic Documents Clustering
Clustering of text documents is an important technique for documents retrieval. It aims to organize documents into meaningful groups or clusters. Preprocessing text plays a main role in enhancing clustering process of Arabic documents. This research examines and compares text preprocessing techniques in Arabic document clustering. It also studies effectiveness of text preprocessing techniques: ...
متن کاملBayesian K-Means as a “Maximization-Expectation” Algorithm
We introduce a new class of “maximization expectation” (ME) algorithms where we maximize over hidden variables but marginalize over random parameters. This reverses the roles of expectation and maximization in the classical EM algorithm. In the context of clustering, we argue that these hard assignments open the door to very fast implementations based on data-structures such as kdtrees and cong...
متن کاملEnergy Efficient Clustering Based on Expectation Maximization for Homogeneous Wireless Sensor Network
WSN comprises of large number of nodes deployed over an area that together operate to perform a task. The nodes have sensing, computing and communication features. Since the nodes are constrained in power, energy, computation it is necessary to deal with these efficiently. The main focus is on enhancing the energy efficiency of network and improving its lifetime. This paper includes a new metho...
متن کاملOn Initialization of the Expectation- Maximization Clustering Algorithm
Iterative clustering algorithms commonly do not lead to optimal cluster solutions. Partitions that are generated by these algorithms are known to be sensitive to the initial partitions that are fed as an input parameter. A “good” selection of initial partitions is an essential clustering problem. In this paper we introduce a new method for constructing the initial partitions set to be used by t...
متن کاملA survey of model-based clustering algorithms for sequential data
Clustering is a fundamental and widely applied method in understanding and exploring a data set. Interest in clustering has increased recently due to the emergence of several new areas of applications including data mining, bioinformatics, web use data analysis, image analysis and so on. Model-based clustering is one of the most important and widely used clustering methods. This paper presents ...
متن کاملScaling-Up Model-Based Clustering Algorithm by Working on Clustering Features
In this paper, we propose EMACF (Expectation-Maximization Algorithm for Clustering Features) to generate clusters from data summaries rather than data items directly. Incorporating with an adaptive grid-based data summarization procedure, we establish a scalable clustering algorithm: gEMACF. The experimental results show that gEMACF can generate more accurate results than other scalable cluster...
متن کامل